In statisticiansix/demonstrandum:

knitr::opts_chunk$set(
    echo = FALSE,
    fig.height = 4.5,
    fig.width = 8,
    message = FALSE,
    warning = FALSE
)
library(DML)
library(tidyverse)
library(magrittr)
library(Rtsne)
library(gridExtra)
library(cluster)
library(factoextra)

# results <- get(params$results)

Introduction to k-Means clustering

For those unfamiliar we will outline key information regarding clustering, and inparticluar k-Means clustering with example where relevant.

What is clustering?

Clustering is a technique that can be used to group data that is similar together, such that it is more similar to members of it's cluster than those of the other clusters. It is an "unsupervised" machine learning technique, meaning that we don't have a specified output that we want to model. Such techniques attempt to find previously unidentified patterns within the data, as opposed to a "supervised" machine learning algorithm that attempts to describe an identified pattern. This makes our task more difficult, as we don't have a benchmark against which to compare our results in order for us to determine whether they are a good description of the data. However we can determine the suitability of our results by looking at the differences between our clusters across our data and the way in which this enables us to understand more about the potential patterns within and therefore potential actions we can take.

When should we use clustering?

Clustering is often used as a data exploration tool or when we want to separate our data into groups (e.g. Customer Segmentation, Fraud Detection, Credit Scoring). The aim of performing a clustering is that we create groups of data points have similar behaviour across a selected set of variables, and differing behaviour when comparing clusters, i.e. similarity within clusters, dissimilarity between clusters. One form of clustering is to create a set of rules to determine how to group the data (e.g. Over 45 & Employed, 35-50 & Self-Employed) however if we don't know the rules that we want to use then we can utilise algorithms to identify how to partition the data into a predefined number of clusters - $k$.

What is k-Means?

"k-Means" is a clustering algorithm which creates groups based upon a data point's distance to the average of the group, placing a data point into the group where the difference between its value and the group average is smallest; the aim being, as with all clustering, to find clusters that group the data in such a way where the data points are similar to each other. The algorithm repeats and updates the group averages, and therefore the group membership, until a stopping condition is met or the group average does not significantly change. As part of the first step in the algorithm random data points are chosen as the initial group averages, as such we can expect different results each time the algorithm is run unless we provide a starting \"seed\" which will generate the same starting point. The quality of the final clusters will be dependent upon the properties of the data matching the assumptions of the algorithm and if these aren't met then another clustering algorithm may be a better choice, or we should look at ways in which we can transform our data to make the assumptions valid.

When should we use k-Means?

When determining what algorithm or technique to use we often need to check whether the data we have is suitable to be used. With k-Means we have the following criteria or assumptions that we need the data to meet:

1) Variables should be continuous.

We shouldn't use integer data with k-Means as we might not be able to interpret the results. Whilst a column may be integers, i.e. 1, 4, 3, 2..., the centers for the cluster may be continuous, i.e. 2.3, which is a different data type. Whilst this may not be a problem the way in which the cluster distances are calculated with integer data will mean the algorithm will also fail to find a stable solution. It is generally advised to not utilise integer data in k-Means without using some kind of aggregation/transformation with other integer data such as factor analysis.

2) All variables should have the same variance.

If one variable has a higher variance then it will have a stronger influence on the clustering than the others. We can see in the example below when the variance of one variable is higher it struggles to identify distinct clusters within the data, once scaled it can identify this pattern.

# varianceexplained.org/r/kmeans-free-lunch/
sizes <- c(100, 100, 100)
centers <- tibble(x = c(1, 4, 6),
                      y = c(5, 1.5, 6)*10,
                      n = sizes,
                      cluster = factor(1:3))
set.seed(314159265)
data <- centers %>% 
  group_by(cluster) %>%
  do(tibble(x = rnorm(.$n, .$x,1), y = rnorm(.$n, .$y,6)))%>%
  ungroup%>%
  select(-cluster)

unscaled <- kmeans(data,3)
use <- data %>% mutate_all(scale)
scaled <- kmeans(use,3)

plottingData <- bind_rows(
  data %>%
  mutate('Cluster'=as.factor(c('B','A','C')[unscaled$cluster]),
         'Type'='Unscaled'),
  use%>%
    mutate('Cluster'=as.factor(LETTERS[scaled$cluster]),
           'Type'='Scaled')
)%>%
  mutate('Type'=factor(Type,levels=c('Unscaled','Scaled')))

ggplot(plottingData)+
  geom_point(aes(x=x,y=y,colour=Cluster))+
  facet_wrap(.~Type,scales='free')+
  theme_bw()

3) The data should be spherically distributed.

Similarly to the first assumption, this requirement on the data comes from how the algorithm calculates the distance of each point to another. If the data is not normally/spherically distributed such as the Figure below (see above for spherical data), then the algorithm cannot identify the clusters appropriately. In such cases we can investigate transformations to the data that would allow us to suitably build clusters using k-Means.

# varianceexplained.org/r/kmeans-free-lunch/
n <- 250
set.seed(314159265)
c1 <- tibble(x = rnorm(n), y = rnorm(n), cluster = 1) %>%
    dplyr::select(x, y)
c2 <- tibble(r = rnorm(n, 5, .25), theta = runif(n, 0, 2 * pi),
                 x = r * cos(theta), y = r * sin(theta), cluster = 2) %>%
    dplyr::select(x, y)
c3 <- tibble(r = rnorm(n, 7, .1), theta = runif(n, 4, 6 * pi),
                 x = r * cos(theta), y = r * sin(theta), cluster = 2) %>%
    dplyr::select(x, y)

circularData <- rbind(c1, c2, c3) %>% mutate(Cluster = as.factor(LETTERS[kmeans(.,3)$cluster]))

ggplot(circularData)+
  geom_point(aes(x=x,y=y,colour=Cluster))+
  theme_bw()

4) It is reasonable to expect the cluster sizes should be equal.

When the algorithm tries to minimise the distance of each point to the other members of a cluster it essentially prioritises large clusters; meaning it views smaller clusters being grouped together as optimal so it can split larger clusters. You can see this in the image below where we changed the sizes of clusters A, B, and C in the example from the second assumption to 100, 45, and 20 respectively.

# varianceexplained.org/r/kmeans-free-lunch/

# k-Means is fairly robust against this in such well defined cases, if you increase any of the small values by 1, or decrease 100 by 1 then it forms a stable solution.

set.seed(314159265)
unequal <- plottingData %>%
  group_by(Cluster)%>%
  filter(Type=='Scaled')%>%
  mutate('Keep'=ifelse(Cluster=='B',
                              ifelse(row_number()<=45,TRUE,FALSE),
                       ifelse(Cluster=='C',
                              ifelse(row_number()<=20,TRUE,FALSE),
                              ifelse(row_number()<=100,TRUE,FALSE)
                              )))%>%
  filter(Keep)%>%
  ungroup%>%
  select(x,y)%>%
  mutate('Cluster'=c('B','C','A')[kmeans(.,3)$cluster],
         'Type'='Unequal Membership')

unequalPlot <- plottingData %>%
  filter(Type=='Scaled')%>%
  mutate(Type='Equal Membership')%>%
  bind_rows(unequal)%>%
  mutate('Type'=factor(Type,levels=c('Unequal Membership','Equal Membership')))

ggplot(unequalPlot)+
  geom_point(aes(x=x,y=y,colour=Cluster))+
  facet_wrap(.~Type,scales='free')+
  theme_bw()

You can see in the image above what we were talking about, the algorithm has grouped the two smaller clusters so it can split the larger cluster in two.

5) It is expected that there is a set of clusters within the data.

This is a fairly obvious assumption however when clustering with more than 2 columns it becomes hard to identify whether there are identifiable clusters within the data. We can get an idea using the following methods: Visual Assessment of cluster Tendency (VAT), Hopkins' statistic and visually using dimensionality reduction techniques. We shall demonstrate these using simulated spherical data with 5 columns that we know are clusterable and we know are non clusterable.

sizes <- c(100, 100, 100)
centers <- tibble(a = c(5, 10, 15),
                      b = c(5, 10, 15),
                      c = c(5, 10, 15),
                      d = c(5, 10, 15),
                      e = c(5, 10, 15),
                      n = sizes,
                      cluster = factor(1:3))
set.seed(314159265)
clusterable <- centers %>% 
  group_by(cluster) %>%
  do(tibble(a = rnorm(.$n, .$a,1), 
                b = rnorm(.$n, .$b,1),
                c = rnorm(.$n, .$c,1), 
                d = rnorm(.$n, .$d,1),
                e = rnorm(.$n, .$e,1)))%>%
  ungroup%>%
  select(-cluster)

nonclusterable <- tibble(a=rnorm(300),
                             b=rnorm(300),
                             c=rnorm(300),
                             d=rnorm(300),
                             e=rnorm(300))


clusterable_clustered <- k_means(clusterable,seed=3141)
nonclusterable_clustered <- k_means(nonclusterable,seed=3141)
clusterable_assessment <- attr(clusterable_clustered,'clusterability')
nonclusterable_assessment <- attr(nonclusterable_clustered,'clusterability')

Dimensionality Reduction

In order to visually inspect data with more than 2 columns we need to use "dimensionality reduction" which creates a transformation of the data that reduces the number of columns. We can then choose the number of dimensions (columns) and plot them accordingly, in this case we will reduce our simulated spherical data from 5 dimensions to 2 dimensions.

grid.arrange(
    attr(clusterable_clustered,'raw_plot')+ggtitle('Clusterable'),
    attr(nonclusterable_clustered,'raw_plot')+ggtitle('Not Clusterable'),
    layout_matrix = rbind(c(1,2),
                        c(1,2))
  )

With the dimensionality reduction we can begin to identify patterns in the data, within this example we can see that we have 3 clusters in the clusterable example, and no identifiable clusters in the other. As this simulated example is designed to show the method the results are perfect and we are unlikely to obtain such clear results in a practical scenario.

Visual Assessment of cluster Tendency (VAT)

With the VAT we plot the data points and look at the dissimilarity of the points to each other, with low dissimilarity (blue in the plots below) indicating that they should be grouped together, ideally we want to see low dissimilarity within the cluster and high dissimilarity (red in the plots below) between clusters.

  grid.arrange(
    clusterable_assessment+ggtitle('Ordered Dissimilarity Image - Clusterable'),
    nonclusterable_assessment+ggtitle('Ordered Dissimilarity Image - Not Clusterable'),
    layout_matrix = rbind(c(1,2),
                        c(1,2))
  )

Again we are seeing idealised results in the VAT plots above, the clusterable data set creates a perfect distinction between clusters whereas the not clusterable data set does not identify any clusters.

Both the VAT and dimensionality reduction are also useful visual ways to identify a potential value of $k$ when we have data with high dimensionality (more than 2 columns) that we want to cluster.

Hopkins' statistic

Hopkins' statistic is a way in which we can assess whether a data set is clusterable by comparing it against a uniform distribution which is flat and does not have any identifiable clusters. The values of Hopkins' statistic, $H$, are between 0 and 1 with a value of 0 indicating the data is not clusterable, and a value of 1 indicating the data is clusterable.

Unlike the previous test we don't use a p-value to determine whether we reject the null hypothesis that the data comes from a uniform distribtion, instead we use the fact that if $H$ is greater than 0.75 then we can reject this at the 90% significance level

knitr::kable(
  data.frame('Data'=c('Clusterable','Not Clusterable'),
           'H'=c(attr(clusterable_assessment,'metrics')$hopkins,attr(nonclusterable_assessment,'metrics')$hopkins))
)

As we can see from the value of $H$ the clusterable data set is not uniformly distributed and may contain obtainable clusters, whereas the other data set does not. What we do see however is that the value of H for the non clusterable data set is remarkably high, and is an example of why we should not take a solitary test as proof of clusterability.

When we look at these assessments together we begin to gain an idea as to whether the data is suitable for clustering, however there may be cases where this can only be confirmed post analysis through interpretation back to the problem that we wish to use clustering to solve.

These assumptions are easy to check in the examples provided above however, as we saw in the last assumption, if we start using 3 or more variables it becomes difficult to check these. Whilst we can use methods to visualise the results of these cases we can't necessarily use them to check the assumptions as they don't always preserve the features of the data that we wish to check. This is why it is important to understand why an algorithm might not work as expected and alternatives in these cases so we can try different approaches that may rectify the problems that we are seeing.

How do we determine the "correct" number of clusters to use?

This is a difficult question to answer, as with most unsupervised machine learning algorithms there is no definitively correct answer. However there are two main methods that we can use to identify the optimal number of clusters given the data.

1) Elbow method

The "elbow" method plots how well our solution describes the data and looks for the value where increasing the number of clusters does not provide a noticeable improvement. This is because with a smaller number of clusters each increase will provide a greater improvement and this improvement will reduce as the number of clusters increases. Within the plot we look for the value where there is an abrupt change or flattening of this curve (inflection) thereby forming an \"elbow\" in the plot.

2) Silhouette method

We can also identify an optimal value of $k$ using the average silhouette value for a range of clustering solutions. The silhouette value is a measure for each data point that looks at the similarity between the cluster it is a member of and between the next nearest cluster resulting in a value between -1 and 1. Data points with a silhouette value close to 1 indicate that the selected data point is similar to other data points within the cluster it is a member of indicating a suitable assignment to this cluster; conversely data points with a silhouette value close to -1 indicate that the data point is more similar to the data points in another cluster indicating a poor assignment to its current cluster. When we take the average silhouette value for all data points we get an indication of the overall suitability of our clustering solution with higher values indicating a better description of the data. We can therefore use this method to identify the optimum number of clusters by calculating the average silhouette value for a range of possible solutions and choosing the solution with the highest average silhouette value.

We shall now demonstrate each method using simulated and real data.

Cluster Selection on Simulated Data

We can visually see with this data that there are 5 clusters, and we can see that both the elbow and silhoutte methods identify this correctly with a clear and distinct inflection and peak respectively.

sizes <- c(100, 100, 100, 100, 100)
centers <- tibble(x = c(1, 4, 6, 9, 11.5),
                      y = c(5, 1.5, 8, 3, 10)*10,
                      n = sizes,
                      cluster = factor(1:5))
set.seed(314159265)
data <- centers %>% 
  group_by(cluster) %>%
  do(tibble(x = rnorm(.$n, .$x,1), y = rnorm(.$n, .$y,6)))%>%
  ungroup%>%
  select(-cluster)

elbow <- k_means(data,seed=3141,optimal_k_method = 'elbow')
silhouette <- k_means(data,seed=3141,optimal_k_method = 'silhouette')

grid.arrange(
  ggplot(data)+
  geom_point(aes(x=x,y=y))+
  theme_bw()+
    ggtitle('Data'),
  attr(elbow,'k_selection'),
  attr(silhouette,'k_selection'),
  attr(elbow,'plot')+
    ggtitle('Cluster Visualisation'),
  layout_matrix = rbind(c(1,1),
                        c(2,3),
                        c(4,4))
)

We have created this data set to show clear examples of the "elbow" and maximum silhouette value, however in real cases this is not always so clear. Let's use an example from a real data set where we can see the elbow is more difficult to identify.

Cluster Selection on Real Data

Let us demonstrate an example using real data where the $k$ selection methods are not so clear cut, we shall use the iris data set commonly used in clustering tutorials. In the iris data set 3 species of iris have their sepal, and petal dimensions in an attempt to see whether we can determine the species of iris from these dimensions.

In the previous example we also had 2 dimensional data (i.e. data with 2 columns), where we can also use visual detection to provide insight into the optimal value of $k$. Let us now use all 4 measurements to determine how many clusters we should use, which means we cannot easily visually inspect the data prior to clustering.

In order to visually inspect the data we need to use "dimensionality reduction" which as discussed earlier creates a transformation of the data that reduces the number of columns. Again, whilst the reduced data can be used in a clustering it is not advised as through this process we can lose information that is useful when constructing our clusters, or the data may no longer be suitable to be clustered using a method such as k-Means due to it no longer meeting the criteria outlined above - particularly the spherical data criterion. This can however be useful to visually identify the number of clusters that we may want to create.

In the figure below we can see that whilst the elbow method selects the known number of clusters in this example the inflection point is much less severe and harder to identify, we also see that the silhouette method struggles to identify the actual number of clusters.

elbow <- k_means(iris[,-5],seed=3141,optimal_k_method = 'elbow')
silhouette <- k_means(iris[,-5],seed=3141,optimal_k_method = 'silhouette')

grid.arrange(
  attr(elbow,'raw_plot'),
  attr(elbow,'k_selection'),
  attr(silhouette,'k_selection'),
  attr(elbow,'plot'),
  attr(silhouette,'plot'),
  layout_matrix = rbind(c(1,1),
                        c(2,3),
                        c(4,5))
)

This example shows that with clustering it is important to exercise a level of analytical choice and it cannot be fully automated. It also shows the difficulty when dealing with more than 2 columns within a clustering, whilst we can visualise through dimensionality reduction we cannot tell which columns, or combination of columns, are affecting the formation of the clustering.

Choice of seed value

At the start of the algorithm a set of random set of starting points are chosen, however with all computers there is no such thing as truly random. Computers generate "psuedo-random" variables by another algorithm, and we can generate the same set of random variables by providing the same starting point or "seed".

In algorithms that use randomly generated starting points, such as the one used here, we can set a "seed" value so our results remain consistent across multiple sessions. Without this we would most likely get different results each time, meaning that any work done from these results would need to be changed each time the algorithm is run, which is impractical and why it is advisable to set a seed. We can however use this to our advantage, by changing the seed we can generate different results.

Algorithms keep on running until either a stopping condition is met, or it has performed a certain number of iterations. These stopping conditions typically involve either maximising or minimising a pre determined value and this value can have many maximum or minimum values (maxima/minima) depending upon how long you look and where it starta and how long it looks. Ideally we want to find what is known as the "global" maxima/minima which is the best solution, however algorithms can get stuck in what are known as "local" maxima/minima which are optimal solutions but not the most optimal. For example if an explorer started in the UK and wanted to find the highest point within the landmass without crossing any water it would be Ben Nevis, however if we started in Belgium we would find Mount Everest - both are true given where we started and the criteria we were looking for, Ben Nevis is a local maxima and Mount Everest is the global maxima a small change in starting point can have a potentially significant impact on results.

Let us now see the effect of this using the iris data from the previous example, we will use 2 different seeds and see the effect this has on the generated solution. Within the plots below you can see the different results that we get, both our dimensionality reduction and clustering use random points and so the results differ. Using the same data, the same clustering method, and the same method to select $k$ we get two very different solutions and visualisations.

seed1 <- k_means(iris[,-5],seed=3141,optimal_k_method = 'elbow')
seed2 <- k_means(iris[,-5],seed=314159,optimal_k_method = 'elbow')

grid.arrange(
  attr(seed1,'k_selection'),
  attr(elbow,'plot')+
    ggtitle('Cluster Visualisation',subtitle ='Elbow Method'),
  attr(seed2,'k_selection'),
  attr(seed2,'plot')+
    ggtitle('Cluster Visualisation',subtitle ='Silhouette Method'),
  layout_matrix = rbind(c(1,2),
                        c(3,4))
)

There is no way to determine the "ideal" seed value it is purely a case of trial and error, if multiple seeds are investigated and there is no satisfactory result then other methods should be investigated.

statisticiansix/demonstrandum documentation built on Dec. 2, 2019, 1:29 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com